63 research outputs found
Using Intelligent Prefetching to Reduce the Energy Consumption of a Large-scale Storage System
Many high performance large-scale storage systems will experience significant workload increases as their user base and content availability grow over time. The U.S. Geological Survey (USGS) Earth Resources Observation and Science (EROS) center hosts one such system that has recently undergone a period of rapid growth as its user population grew nearly 400% in just about three years. When administrators of these massive storage systems face the challenge of meeting the demands of an ever increasing number of requests, the easiest solution is to integrate more advanced hardware to existing systems. However, additional investment in hardware may significantly increase the system cost as well as daily power consumption. In this paper, we present evidence that well-selected software level optimization is capable of achieving comparable levels of performance without the cost and power consumption overhead caused by physically expanding the system. Specifically, we develop intelligent prefetching algorithms that are suitable for the unique workloads and user behaviors of the world\u27s largest satellite images distribution system managed by USGS EROS. Our experimental results, derived from real-world traces with over five million requests sent by users around the globe, show that the EROS hybrid storage system could maintain the same performance with over 30% of energy savings by utilizing our proposed prefetching algorithms, compared to the alternative solution of doubling the size of the current FTP server farm
\u3cem\u3eHP-DAEMON\u3c/em\u3e: \u3cem\u3eH\u3c/em\u3eigh \u3cem\u3eP\u3c/em\u3eerformance \u3cem\u3eD\u3c/em\u3eistributed \u3cem\u3eA\u3c/em\u3edaptive \u3cem\u3eE\u3c/em\u3energy-efficient \u3cem\u3eM\u3c/em\u3eatrix-multiplicati\u3cem\u3eON\u3c/em\u3e
The demands of improving energy efficiency for high performance scientific applications arise crucially nowadays. Software-controlled hardware solutions directed by Dynamic Voltage and Frequency Scaling (DVFS) have shown their effectiveness extensively. Although DVFS is beneficial to green computing, introducing DVFS itself can incur non-negligible overhead, if there exist a large number of frequency switches issued by DVFS. In this paper, we propose a strategy to achieve the optimal energy savings for distributed matrix multiplication via algorithmically trading more computation and communication at a time adaptively with user-specified memory costs for less DVFS switches, which saves 7.5% more energy on average than a classic strategy. Moreover, we leverage a high performance communication scheme for fully exploiting network bandwidth via pipeline broadcast. Overall, the integrated approach achieves substantial energy savings (up to 51.4%) and performance gain (28.6% on average) compared to ScaLAPACK pdgemm() on a cluster with an Ethernet switch, and outperforms ScaLAPACK and DPLASMA pdgemm() respectively by 33.3% and 32.7% on average on a cluster with an Infiniband switch
Network Binarization via Contrastive Learning
Neural network binarization accelerates deep models by quantizing their
weights and activations into 1-bit. However, there is still a huge performance
gap between Binary Neural Networks (BNNs) and their full-precision (FP)
counterparts. As the quantization error caused by weights binarization has been
reduced in earlier works, the activations binarization becomes the major
obstacle for further improvement of the accuracy. BNN characterises a unique
and interesting structure, where the binary and latent FP activations exist in
the same forward pass (i.e., ).
To mitigate the information degradation caused by the binarization operation
from FP to binary activations, we establish a novel contrastive learning
framework while training BNNs through the lens of Mutual Information (MI)
maximization. MI is introduced as the metric to measure the information shared
between binary and FP activations, which assists binarization with contrastive
learning. Specifically, the representation ability of the BNNs is greatly
strengthened via pulling the positive pairs with binary and FP activations from
the same input samples, as well as pushing negative pairs from different
samples (the number of negative pairs can be exponentially large). This
benefits the downstream tasks, not only classification but also segmentation
and depth estimation, etc. The experimental results show that our method can be
implemented as a pile-up module on existing state-of-the-art binarization
methods and can remarkably improve the performance over them on CIFAR-10/100
and ImageNet, in addition to the great generalization ability on NYUD-v2.Comment: Accepted to ECCV 202
Lipschitz Continuity Retained Binary Neural Network
Relying on the premise that the performance of a binary neural network can be
largely restored with eliminated quantization error between full-precision
weight vectors and their corresponding binary vectors, existing works of
network binarization frequently adopt the idea of model robustness to reach the
aforementioned objective. However, robustness remains to be an ill-defined
concept without solid theoretical support. In this work, we introduce the
Lipschitz continuity, a well-defined functional property, as the rigorous
criteria to define the model robustness for BNN. We then propose to retain the
Lipschitz continuity as a regularization term to improve the model robustness.
Particularly, while the popular Lipschitz-involved regularization methods often
collapse in BNN due to its extreme sparsity, we design the Retention Matrices
to approximate spectral norms of the targeted weight matrices, which can be
deployed as the approximation for the Lipschitz constant of BNNs without the
exact Lipschitz constant computation (NP-hard). Our experiments prove that our
BNN-specific regularization method can effectively strengthen the robustness of
BNN (testified on ImageNet-C), achieving state-of-the-art performance on CIFAR
and ImageNet.Comment: Paper accepted to ECCV 202
Reduce, Reuse, Recycle: Improving Training Efficiency with Distillation
Methods for improving the efficiency of deep network training (i.e. the
resources required to achieve a given level of model quality) are of immediate
benefit to deep learning practitioners. Distillation is typically used to
compress models or improve model quality, but it's unclear if distillation
actually improves training efficiency. Can the quality improvements of
distillation be converted into training speed-ups, or do they simply increase
final model quality with no resource savings? We conducted a series of
experiments to investigate whether and how distillation can be used to
accelerate training using ResNet-50 trained on ImageNet and BERT trained on C4
with a masked language modeling objective and evaluated on GLUE, using common
enterprise hardware (8x NVIDIA A100). We found that distillation can speed up
training by up to 1.96x in ResNet-50 trained on ImageNet and up to 1.42x on
BERT when evaluated on GLUE. Furthermore, distillation for BERT yields optimal
results when it is only performed for the first 20-50% of training. We also
observed that training with distillation is almost always more efficient than
training without distillation, even when using the poorest-quality model as a
teacher, in both ResNet-50 and BERT. Finally, we found that it's possible to
gain the benefit of distilling from an ensemble of teacher models, which has
O(n) runtime cost, by randomly sampling a single teacher from the pool of
teacher models on each step, which only has a O(1) runtime cost. Taken
together, these results show that distillation can substantially improve
training efficiency in both image classification and language modeling, and
that a few simple optimizations to distillation protocols can further enhance
these efficiency improvements
Improving write performance by enhancing internal parallelism of Solid State Drives
Abstract—Most researches of Solid State Drives (SSDs) archi-tectures rely on Flash Translation Layer (FTL) algorithms and wear-leveling; however, internal parallelism in Solid State Drives has not been well explored. In this research, we proposed a new strategy to improve SSD write performance by enhancing internal parallelism inside SSDs. A SDRAM buffer is added in the design for buffering and scheduling write requests. Because the same logical block numbers may be translated to different physical numbers at different times in FTL, the on-board SDRAM buffer is used to buffer requests at the lower level of FTL. When the buffer is full, same amount of data will be assigned to each storage package in SSDs to enhance internal parallelism. To accurately evaluate performance, we use both synthetic workloads and real-world applications in experiments. We compare the enhanced internal parallelism scheme with the traditional LRU strategy since it is unfair to compare an SSD having buffer with an SSD without a buffer. The simulation results demonstrate that the writing performance of our design is significantly improved compared with the LRU-cache strategy with the same amount of buffer sizes. I
Real-time Monitoring for the Next Core-Collapse Supernova in JUNO
Core-collapse supernova (CCSN) is one of the most energetic astrophysical
events in the Universe. The early and prompt detection of neutrinos before
(pre-SN) and during the SN burst is a unique opportunity to realize the
multi-messenger observation of the CCSN events. In this work, we describe the
monitoring concept and present the sensitivity of the system to the pre-SN and
SN neutrinos at the Jiangmen Underground Neutrino Observatory (JUNO), which is
a 20 kton liquid scintillator detector under construction in South China. The
real-time monitoring system is designed with both the prompt monitors on the
electronic board and online monitors at the data acquisition stage, in order to
ensure both the alert speed and alert coverage of progenitor stars. By assuming
a false alert rate of 1 per year, this monitoring system can be sensitive to
the pre-SN neutrinos up to the distance of about 1.6 (0.9) kpc and SN neutrinos
up to about 370 (360) kpc for a progenitor mass of 30 for the case
of normal (inverted) mass ordering. The pointing ability of the CCSN is
evaluated by using the accumulated event anisotropy of the inverse beta decay
interactions from pre-SN or SN neutrinos, which, along with the early alert,
can play important roles for the followup multi-messenger observations of the
next Galactic or nearby extragalactic CCSN.Comment: 24 pages, 9 figure
- …